114 research outputs found

    A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data

    Get PDF
    Many interesting data sets available on the Internet are of a medium size---too big to fit into a personal computer's memory, but not so large that they won't fit comfortably on its hard disk. In the coming years, data sets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality.Comment: 30 pages, plus supplementary material

    Lessons from Between the White Lines for Isolated Data Scientists

    Get PDF
    Many current and future data scientists will be “isolated”—working alone or in small teams within a larger organization. This isolation brings certain challenges as well as freedoms. Drawing on my considerable experience both working in the professional sports industry and teaching in academia, I discuss troubled waters likely to be encountered by newly minted data scientists and offer advice about how to navigate them. Neither the issues raised nor the advice given are particular to sports and should be applicable to a wide range of knowledge domains

    Greater data science at baccalaureate institutions

    Get PDF
    Donoho's JCGS (in press) paper is a spirited call to action for statisticians, who he points out are losing ground in the field of data science by refusing to accept that data science is its own domain. (Or, at least, a domain that is becoming distinctly defined.) He calls on writings by John Tukey, Bill Cleveland, and Leo Breiman, among others, to remind us that statisticians have been dealing with data science for years, and encourages acceptance of the direction of the field while also ensuring that statistics is tightly integrated. As faculty at baccalaureate institutions (where the growth of undergraduate statistics programs has been dramatic), we are keen to ensure statistics has a place in data science and data science education. In his paper, Donoho is primarily focused on graduate education. At our undergraduate institutions, we are considering many of the same questions.Comment: in press response to Donoho paper in Journal of Computational Graphics and Statistic

    The Impact of College Athletic Success on Donations and Applicant Quality

    Get PDF
    For the 65 colleges and universities that participate in the Power Five athletic conferences (Pac 12, Big 10, SEC, ACC, and Big 12), the football and men’s basketball teams are highly visible. While these programs generate tens of millions of dollars in revenue annually, very few of them turn an operating “profit.” Their existence is thus justified by the claim that athletic success leads to ancillary benefits for the academic institution, in terms of both quantity (e.g., more applications, donations, and state funding) and quality (e.g., stronger applicants, lower acceptance rates, higher yields). Previous studies provide only weak support for some of these claims. Using data from 2006–2016 and a multiple regression model with corrections for multiple testing, we find that while a successful football program is associated with more applicants, there is no effect on the composition of the student body or (with a few caveats) funding for the school through donations or state appropriations

    Quantifying Market Inefficiencies in the Baseball Players’ Market

    Get PDF
    Among the central arguments of the bestselling book and movie Moneyball was the allegation that the labor market for baseball players was inefficient in 2002. At that time, Billy Beane and the Oakland Athletics used observations made by statistical analysts to exploit this market inefficiency, and acquire productive players on the cheap. Econometric analysis published in 2006 and 2007 confirmed the presence of an inefficient market for baseball players, but left open the question of to what extent, and how quickly, a market correction would occur. We find that this market had in fact already corrected by 2006, and moreover argue that the perceived market response to Moneyball in 2004 is properly viewed as part of a more gradual longer-term trend. In addition, we use official payroll data from Major League Baseball to refute a previous observation that the relationship between team payroll and performance has tightened since the publication of Moneyball

    Average Case Network Lifetime on an Interval with Adjustable Sensing Ranges

    Get PDF
    Given n sensors on an interval, each of which is equipped with an adjustable sensing radius and a unit battery charge that drains in inverse linear proportion to its radius, what schedule will maximize the lifetime of a network that covers the entire interval? Trivially, any reasonable algorithm is at least a 2-approximation for this Sensor Strip Cover problem, so we focus on developing an efficient algorithm that maximizes the expected network lifetime under a random uniform model of sensor distribution. We demonstrate one such algorithm that achieves an expected network lifetime within 12 % of the theoretical maximum. Most of the algorithms that we consider come from a particular family of RoundRobin coverage, in which sensors take turns covering predefined areas until their battery runs out

    Creating Optimal Conditions for Reproducible Data Analysis in R with ‘Fertile’

    Get PDF
    The advancement of scientific knowledge increasingly depends on ensuring that data-driven research is reproducible: that two people with the same data obtain the same results. However, while the necessity of reproducibility is clear, there are significant behavioral and technical challenges that impede its widespread implementation and no clear consensus on standards of what constitutes reproducibility in published research. We present fertile, an R package that focuses on a series of common mistakes programmers make while conducting data science projects in R, primarily through the RStudio integrated development environment. fertile operates in two modes: proactively, to prevent reproducibility mistakes from happening in the first place, and retroactively, analyzing code that is already written for potential problems. Furthermore, fertile is designed to educate users on why their mistakes are problematic and how to fix them

    How Often Does the Best Team Win? A Unified Approach to Understanding Randomness in North American Sport

    Get PDF
    Statistical applications in sports have long centered on how to best separate signal (e.g. team talent) from random noise. However, most of this work has concentrated on a single sport, and the development of meaningful cross-sport comparisons has been impeded by the difficulty of translating luck from one sport to another. In this manuscript, we develop Bayesian state-space models using betting market data that can be uniformly applied across sporting organizations to better understand the role of randomness in game outcomes. These models can be used to extract estimates of team strength, the between-season, within-season, and game-to-game variability of team strengths, as well each team’s home advantage. We implement our approach across a decade of play in each of the National Football League (NFL), National Hockey League (NHL), National Basketball Association (NBA), and Major League Baseball (MLB), finding that the NBA demonstrates both the largest dispersion in talent and the largest home advantage, while the NHL and MLB stand out for their relative randomness in game outcomes. We conclude by proposing new metrics for judging competitiveness across sports leagues, both within the regular season and using traditional postseason tournament formats. Although we focus on sports, we discuss a number of other situations in which our generalizable models might be usefully applied
    corecore